data = read.csv(file = "ccn_reind.csv", sep = ";", encoding = "UTF-8", stringsAsFactors = FALSE)
head(data)glimpse(data)## Observations: 1,855
## Variables: 5
## $ ind <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ date <chr> "2018-04-30 18:13:00", "2018-04-30 17:26:00", "2018-04...
## $ title <chr> "Mastercard Seeks Patent For Fast Tracking Blockchain ...
## $ article <chr> " Mastercard International Inc. has applied for a pate...
## $ url <chr> "https://www.ccn.com/mastercard-international-seeks-pa...
At this point we could stem the documents (article column) so that ex. cryptocurrency and cryptocurrencies would be considered the same word. To speed the computation, we don’t do this now, below there is a code to do this:
# library("SnowballC")
# library(tm)
# data$article = stemDocument(data$article)We can remove the URL and date columns (unnecessary for this analysis):
data$url = NULL
data$date = NULL
glimpse(data)## Observations: 1,855
## Variables: 3
## $ ind <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ title <chr> "Mastercard Seeks Patent For Fast Tracking Blockchain ...
## $ article <chr> " Mastercard International Inc. has applied for a pate...
Now, let’s convert this dataframe into a tidy format and remove stopwords:
by_title_word = data %>% unnest_tokens(word, article) %>% anti_join(stop_words)## Joining, by = "word"
by_title_wordby_title_word %>% count(word, sort = TRUE) ## Warning: package 'bindrcpp' was built under R version 3.4.4
let’s add a column that is a count of unique words in each document, i.e. article.
word_counts = by_title_word %>% count(title, word, sort = TRUE)
word_countsTo make computation faster and downweight the importance of obvious words such as bitcoin or cryptocurrency. Let’s remove irrelevant words by using tf-idf (WHY? to avoid dominance of words Bitcoin, Cryprocurrency etc. which occur in almost any document and remove very uncommon words that occur only few times or only in very few documents). We do this by using tidytext::bind_tf_idf:
desc_tf_idf = by_title_word %>% count(ind, title, word, sort = TRUE) %>% bind_tf_idf(ind, word, n) %>% arrange(-tf_idf)
desc_tf_idf %>% select(-title)We could remove numbers, but they might be important: ex. dramatic currency value changes reported in the news. Maybe they will be clustered into one topic? Let’s keep them. What tf-idf should we consider a threshold for removing unimportant terms?
c(min(desc_tf_idf$tf_idf), max(desc_tf_idf$tf_idf))## [1] 0.0004398986 6.4508556598
ggplot(data = desc_tf_idf, aes(x = tf_idf)) + geom_histogram() #+ scale_x_log10()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
nrow(desc_tf_idf)## [1] 295839
paste("We would remove", nrow(desc_tf_idf) - desc_tf_idf %>% filter(tf_idf > 0.10) %>% nrow(), "rows and keep only", desc_tf_idf %>% filter(tf_idf > 0.10) %>% nrow())## [1] "We would remove 176030 rows and keep only 119809"
small_tf_idf = desc_tf_idf %>% filter(tf_idf > 0.10)
small_tf_idfRight now this data frame is in a tidy form: we treat each article as 1 document, so we have one-term-per-document-per-row format. However, the topicmodels package requires a tm::DocumentTermMatrix. We can cast a one-token-per-row table into a DocumentTermMatrix with tidytext::cast_dtm:
dtm <- small_tf_idf %>% cast_dtm(title, word, n) # title = Document, word = term
dtm## <<DocumentTermMatrix (documents: 1854, terms: 25745)>>
## Non-/sparse entries: 119763/47611467
## Sparsity : 100%
## Maximal term length: 43
## Weighting : term frequency (tf)
Now we are ready to use the topicmodels package to create LDA model. Since LDA is unsupervised learning, it is difficult to choose the correct number of topics. Coherence value can help to find a good value. If you want to find more about this: https://cran.r-project.org/web/packages/textmineR/vignettes/c_topic_modeling.html
In the following we fit two LDA models: one with 16 topics and one with 6.
lda16 <- LDA(dtm, k = 16, control = list(seed = 1234))
lda16## A LDA_VEM topic model with 16 topics.
tidytext allows us to go back to a tidy format, using the tidy() and augment() functions from the broom package. We start with the tidy() function:
tidy_lda16 = tidy(lda16)
head(tidy_lda16)str(tidy_lda16)## Classes 'tbl_df', 'tbl' and 'data.frame': 411920 obs. of 3 variables:
## $ topic: int 1 2 3 4 5 6 7 8 9 10 ...
## $ term : chr "8,039" "8,039" "8,039" "8,039" ...
## $ beta : num 4.46e-202 8.41e-202 5.86e-202 3.06e-202 3.45e-202 ...
one-topic-per-term-per-row format. For each combination the model returns ?? = probability of that term being generated from that topic.dplyr::top_n:tidy_lda16 %>% group_by(topic) %>% arrange(-beta) %>% top_n(5, beta) # 5 top terms per 16 topics = 80 rowsSince the probabilities are not very high, let’s try to use just 6 topics in the second model:
lda6 <- LDA(dtm, k = 6, control = list(seed = 1234))
tidy_lda6 = tidy(lda6)
tidy_lda6 %>% group_by(topic) %>% arrange(-beta) %>% top_n(14, beta) # 14 top terms per 6 topics = 84 rows to get similar nr of resultsThe probabilities now are even lower! So the model with 6 topics is probably worse. But it’s interesting to see that there are similar words for topic 4 in both models. Let’s investigate this:
tidy_lda6 %>% group_by(topic) %>% filter(topic == 4) %>% arrange(-beta)tidy_lda16 %>% group_by(topic) %>% filter(topic == 4) %>% arrange(-beta)We can see that terms petro, oil, russian, venezuelan and maduro were assigned to the same topic in both models. Petro is a cryptocurrency developed by the government of Venezuela, which is claimed to be backed by the country’s oil and mineral reserves, and it is intended to supplement Venezuela’s devalued currency bolívar.
In the last try, let’s look at the probabilities for each topic when we opt for 18 topics:
lda18 <- LDA(dtm, k = 18, control = list(seed = 1234))
tidy_lda18 = tidy(lda18)
tidy_lda18 %>% group_by(topic) %>% arrange(-beta) %>% top_n(5, beta) # 14 top terms per 6 topics = 84 rows to get similar nr of resultsSome probabilities went up. It seems like 18 topics produce a better model than lda6 and lda18. Let’s look at the topic 4 again:
tidy_lda18 %>% group_by(topic) %>% filter(topic == 4) %>% arrange(-beta)the same pattern! you may ask: why Russian? The cryptocurrency petro was allegedly created in a half-hidden collaboration with the Government of Russia: http://time.com/5206835/exclusive-russia-petro-venezuela-cryptocurrency/
Going back to topic modelling, we can find and visualize the top terms per topic:
top_terms <- tidy_lda18 %>% group_by(topic) %>% top_n(5, beta) %>%
ungroup() %>% arrange(topic, -beta) # ORDER the output from above BY topic ASC a. beta DESC
top_termsVisualization:
top_terms %>% mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = topic)) +
geom_bar(stat = "identity", show.legend = FALSE) + # If you want heights of bars to represent values in the data
coord_flip() + # (you need to map a value to the y aesthetic) - without stat = "identity" plot returns an error
facet_wrap(~ topic, ncol = 6, scales = "free") + # free cause scales vary across rows and columns
theme(axis.text.x = element_text(size = 6, angle = 90)) +
labs(title = "The highest tf-idf words per topic in CCN cryptocurrency news", x = "tf-idf word", y = "Beta probability of that word belonging to this topic")we can also find out which topics are associated with each document:
gamma_lda18 = tidy(lda18, matrix="gamma")
gamma_lda18 %>% arrange(-gamma)The more words in each document are assigned to that topic, the more weight (gamma) will go on that document-topic classification. Let’s look at the distribution of gamma values (we plot Y on a log scale to see more detailed, without it almost all values equal 0, some of them equal 1):
ggplot(data=gamma_lda18, aes(x=gamma)) + geom_histogram() + scale_y_log10()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
One important step in the topic modeling is assigning each word in each document to a topic. - The more words in a document are assigned to a topic, the more weight (gamma) will go on that document-topic classification. - We may want to take the original document-word pairs and find which words in each document were assigned to which topic. This is the job of the augment() function.
assignments <- augment(lda18, data = dtm)
str(assignments)## Classes 'tbl_df', 'tbl' and 'data.frame': 119763 obs. of 4 variables:
## $ document: chr "Newsflash: Bitcoin Price Shoots Vertically Above $8,000" "Newsflash: Bitcoin Price Shoots Vertically Above $8,000" "Newsflash: Bitcoin Price Loses $3,000 in Major Correction, Regains Upward Run" "Newsflash: Bitcoin Price Loses $3,000 in Major Correction, Regains Upward Run" ...
## $ term : chr "8,039" "flung" "14,106" "14,106.32" ...
## $ count : num 1 1 1 1 1 1 1 1 1 1 ...
## $ .topic : num 15 15 9 9 9 5 15 15 15 1 ...
assignments